feat(chat): Add durable agent continuation queue#470
Merged
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
dcramer
added a commit
that referenced
this pull request
Jun 2, 2026
Keep queued Slack mailbox records pending until the Slack runtime handoff succeeds. Mark only processed records as injected after the handler returns. Complete successful Slack handlers even after the soft deadline has elapsed. This avoids duplicate queue nudges that can replay injected work. Refs GH-470 Co-Authored-By: GPT-5 Codex <codex@openai.com>
dcramer
added a commit
that referenced
this pull request
Jun 2, 2026
Preserve runnable state across leased queue work, process pending Slack mailbox records before timeout resumes, and avoid replaying already-injected Slack work after recovery. Add replay protection for already-delivered Slack replies and a high-water timeout slice cap so pathological continuations fail instead of scheduling forever. Refs GH-470 Co-Authored-By: GPT-5 Codex <codex@openai.com>
dcramer
added a commit
that referenced
this pull request
Jun 2, 2026
Preserve runnable conversation state when a leased worker completes after needsRun was marked during execution. This keeps continuation and late work recovery from waiting on heartbeat repair. Scope worker and heartbeat queue idempotency keys to a specific wake-up attempt so provider dedupe cannot suppress later legitimate recovery nudges. Move deterministic worker, lease, mailbox, and timeout-resume coverage into component tests and document the layer boundary. Refs GH-470 Co-Authored-By: Codex GPT-5 <noreply@openai.com>
dcramer
added a commit
that referenced
this pull request
Jun 2, 2026
Treat app_home_opened view publishing as a best-effort side effect so transient Slack API failures do not make the Events API webhook return 500 and trigger repeated Slack retries. Add a Slack webhook integration test that drives a signed app_home_opened event through the real ingress path while views.publish fails. Refs GH-470 Co-Authored-By: Codex GPT-5 <noreply@openai.com>
dcramer
added a commit
that referenced
this pull request
Jun 2, 2026
Derive restored Slack thread subscription context from the promoted batch route used for dispatch. Mixed queued batches now pass a mention-context thread into the mention handler instead of inheriting subscribed context from the latest message metadata. Add a component regression that persists a mention plus a subscribed follow-up, verifies the mixed mailbox routes, and checks the restored thread context observed by the runtime. Refs GH-470 Co-Authored-By: Codex GPT-5 <noreply@openai.com>
dcramer
added a commit
that referenced
this pull request
Jun 2, 2026
Keep queued Slack mailbox records pending until the Slack runtime handoff succeeds. Mark only processed records as injected after the handler returns. Complete successful Slack handlers even after the soft deadline has elapsed. This avoids duplicate queue nudges that can replay injected work. Refs GH-470 Co-Authored-By: GPT-5 Codex <codex@openai.com>
dcramer
added a commit
that referenced
this pull request
Jun 2, 2026
Preserve runnable state across leased queue work, process pending Slack mailbox records before timeout resumes, and avoid replaying already-injected Slack work after recovery. Add replay protection for already-delivered Slack replies and a high-water timeout slice cap so pathological continuations fail instead of scheduling forever. Refs GH-470 Co-Authored-By: GPT-5 Codex <codex@openai.com>
dcramer
added a commit
that referenced
this pull request
Jun 2, 2026
Preserve runnable conversation state when a leased worker completes after needsRun was marked during execution. This keeps continuation and late work recovery from waiting on heartbeat repair. Scope worker and heartbeat queue idempotency keys to a specific wake-up attempt so provider dedupe cannot suppress later legitimate recovery nudges. Move deterministic worker, lease, mailbox, and timeout-resume coverage into component tests and document the layer boundary. Refs GH-470 Co-Authored-By: Codex GPT-5 <noreply@openai.com>
dcramer
added a commit
that referenced
this pull request
Jun 2, 2026
Treat app_home_opened view publishing as a best-effort side effect so transient Slack API failures do not make the Events API webhook return 500 and trigger repeated Slack retries. Add a Slack webhook integration test that drives a signed app_home_opened event through the real ingress path while views.publish fails. Refs GH-470 Co-Authored-By: Codex GPT-5 <noreply@openai.com>
505841e to
2e8846d
Compare
dcramer
added a commit
that referenced
this pull request
Jun 2, 2026
Derive restored Slack thread subscription context from the promoted batch route used for dispatch. Mixed queued batches now pass a mention-context thread into the mention handler instead of inheriting subscribed context from the latest message metadata. Add a component regression that persists a mention plus a subscribed follow-up, verifies the mixed mailbox routes, and checks the restored thread context observed by the runtime. Refs GH-470 Co-Authored-By: Codex GPT-5 <noreply@openai.com>
dcramer
added a commit
that referenced
this pull request
Jun 2, 2026
Update the rebased lockfile so Vercel Queue peer resolution covers the dashboard and example workspace entries now present on main. Refs GH-470 Co-Authored-By: Codex GPT-5 <noreply@openai.com>
dcramer
added a commit
that referenced
this pull request
Jun 2, 2026
Move Slack event callback processing behind waitUntil so Slack receives a fast acknowledgement while durable mailbox work still runs through the existing handoff path. Give Vercel Queue visibility a buffer beyond the function timeout to avoid redelivery racing host teardown. Refs GH-470 Co-Authored-By: Codex GPT-5 <noreply@openai.com>
Add a component regression that sends a Slack message_changed request through the durable webhook and queued worker path. This keeps edited mention coverage on the new mailbox architecture, not only the legacy Chat SDK webhook path. Refs GH-470 Co-Authored-By: GPT-5 Codex <codex@openai.com>
Use the active turn deadline budget in timeout errors and timeout telemetry. This keeps resumed turns with shorter host request deadlines from reporting the configured maximum instead of the operative timeout. Refs GH-470 Co-Authored-By: GPT-5 Codex <codex@openai.com>
Treat tool results, not steering user messages, as the terminal assistant output boundary. This prevents mid-turn steering from truncating assistant text that belongs to the same finalized reply. Refs GH-470 Co-Authored-By: GPT-5 Codex <codex@openai.com>
Apply Pi steering inside the durable injection callback so a steering failure rejects before mailbox records are marked injected. This keeps Slack follow-ups pending for a later worker instead of silently dropping them. Refs GH-470 Co-Authored-By: GPT-5 Codex <codex@openai.com>
Thread durable worker yield checks into Pi safe boundaries so Slack turns pause before starting another model iteration when the worker soft deadline has elapsed. Carry host request deadlines into resumed Slack turns and skip heartbeat timeout-resume recovery when the persisted conversation no longer marks that session as the active turn. Refs GH-470 Co-Authored-By: GPT-5 Codex <codex@openai.com>
Persist soft-yield boundaries with a distinct yield resume reason so routine worker continuation does not consume timeout resume slices. Bubble cooperative yield through Slack runtime handling so the generic conversation worker releases the lease and requeues the next slice at the durable worker boundary. Refs GH-470 Co-Authored-By: GPT-5 Codex <codex@openai.com>
Mark expired conversation leases runnable so recovered queue nudges can reach continuation scanning even when mailbox messages were already injected. Include cooperative yield records in stale continuation heartbeat recovery. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Treat a timeout or yield continuation summary without a valid resume request as a worker failure instead of completed idle work. This keeps the conversation runnable for queue retry or heartbeat repair instead of clearing needsRun after recovered idle work. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Make Slack resume startup report whether a continuation actually started so idle durable work can distinguish a real resume from a stale no-op. Terminalize invalid or skipped awaiting timeout/yield sessions before completing idle work, and cover both invalid metadata and stale active-turn mismatch with component tests. Co-Authored-By: GPT-5 Codex <codex@openai.com>
When a new Slack message arrives while a previous turn is awaiting resume, schedule the old continuation without marking the new message as replied. This keeps the follow-up available for steering or the next handled turn instead of silently treating it as answered. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Return a model-visible MCP tool error when a resumed slice asks for a tool name that is no longer present in the rebuilt provider catalog. Include recovery guidance so the agent can refresh the provider catalog before retrying. Fixes GH-492 Co-Authored-By: GPT-5 Codex <codex@openai.com>
Only mark Slack mailbox records injected after the runtime has durably persisted the turn handoff. Preserve pending mailbox work when a handler only reschedules an awaiting continuation, while still marking already-handled early replies after their thread state is saved. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Prefer awaiting turn continuation recovery before routing pending Slack mailbox records. When a continuation starts, leave pending mail runnable so the queue re-drives it after the active turn finishes instead of looping through a no-handoff Slack handler path. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Sign Vercel conversation work queue payloads with JUNIOR_SECRET and verify the decoded callback payload before processing work. This keeps the public queue route from executing forged conversation work even outside Vercel trigger enforcement. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Keep failed continuation persistence from becoming a generated assistant error reply, while preserving the terminal timeout slice-cap record as failed state. Propagate Slack handoff lost-lease results through the conversation worker so downstream completion does not treat the run as successful. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Handle already-replied Slack deliveries before rescheduling an active continuation so duplicate events complete mailbox handoff without nudging the old turn again. Keep auth-pause persistence failure behavior aligned with the existing provider-error contract while preserving the continuation failure fixes for yield and timeout paths. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Worker errors already marked the conversation runnable before releasing its lease, but a recent enqueue marker could make heartbeat defer recovery. Send a fresh wake-up nudge for failed runner slices so runnable work is redelivered promptly. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Cooperative yields could snapshot Pi state before queued steering messages appeared in agent.state.messages. Keep the latest safe boundary candidate available so yielded, timed-out, and auth-paused resumes do not overwrite a longer steering-aware transcript. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Bound signed conversation-work callbacks to a small timestamp window so old queue payloads cannot be replayed indefinitely. When a Slack follow-up arrives during an active parked turn, keep auth pauses parked and fail malformed awaiting continuations before accepting new work. This prevents a fresh turn from replacing the durable session state. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Preserve rapid Slack follow-ups through durable conversation work and tighten timeout-resume scheduling. Add shared test adapters for queue, Slack webhook, Slack outbox, waitUntil, and signed resume requests so tests exercise real boundaries with less mocking. Document Django-inspired test adapter principles and stabilize slow Slack integration timeouts under clustered runs. Refs GH-470 Co-Authored-By: GPT-5 Codex <codex@openai.com>
Replace automatic processing eyes with a completion check when Slack turns finish. Leave parked, skipped, and failed turns without completion so reactions match the lifecycle. Restore requester credential context when timeout resumes rebuild reply context after the rebase. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Report whether running turn checkpoints actually persist before committing durable Slack mailbox input. Propagate lost-input ownership errors instead of allowing a successful turn without mailbox commit. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Keep runnable conversation ids in the heartbeat recovery index when pruning overflow entries. Treat failed worker lease check-ins as lost ownership so in-flight work cannot complete after lease loss. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Keep terminal timeout resume failures on the error path after the session record reaches the slice cap. This prevents Slack delivery from treating an exhausted turn as a successful assistant reply containing an Error-prefixed message. Add a regression that seeds the durable session at the timeout cap and verifies the runtime throws while persisting the failed terminal record. Co-Authored-By: GPT-5 Codex <codex@openai.com>
Skip duplicate inbound retries when the conversation already has a fresh queue marker. Repair stale or missing markers with a fresh idempotency key so failed handoffs still recover promptly. Add fake queue attempt introspection and Slack retry coverage so duplicate sends are visible in tests. Refs GH-470 Co-Authored-By: GPT-5 Codex <codex@openai.com>
Drop unused imports left after the latest rebase conflict resolution so lint passes with warnings denied. Refs GH-470 Co-Authored-By: GPT-5 Codex <codex@openai.com>
When a Slack turn only reschedules an awaiting continuation, persist and commit the input hooks before returning. This lets durable mailbox workers mark the inbound row injected without sending a visible reply. Refs GH-470 Co-Authored-By: GPT-5 Codex <codex@openai.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 91b0e39. Configure here.
Mark lost-lease worker exits as runnable before releasing the conversation lease so queued work can recover immediately instead of waiting for lease TTL repair. Refs GH-470 Co-Authored-By: GPT-5 Codex <codex@openai.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Move production Slack turn execution to a durable conversation mailbox with Vercel Queue wake-ups. This lets timed-out or vanished serverless workers recover through heartbeat-driven continuation instead of waiting for another user message.
Durable Execution
Adds conversation work state, leases, check-ins, queue callbacks, and heartbeat repair for expired leases or stranded mailbox work. The queue callback is exposed as
/api/internal/agent/continueand carries onlyconversationId.Slack Cutover
Slack webhooks now normalize inbound events into durable mailbox records and wake the worker. Routine timeout/cooperative continuation no longer posts visible thread notices; progress remains owned by assistant status and
reportProgress.Specs And Verification
Documents the task execution contract, updates resumability and Slack delivery specs, refreshes Pi agent integration references, and adds integration coverage for mailbox execution, heartbeat recovery, Slack ingress, timeout continuation, and Vercel queue config.